Machine Learning Functions

The following is a brief guide to the various ML functions included in the Data Flow section of Model

Clustering

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)

DBSCAN

  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm. It groups together points by distance
  • Used when cluster number is unknown, it is considerably accurate. It can be used in indoor location for understanding number of rooms, common location, etc.
  • Requires minimal number of neighbors and maximum distance for a neighbor.
  • Analyze by scatter / bubble graph color by cluster number, may indicate a similar group that later can be filter and further analyze. The number of groups by themselves can suggest a different approach to analyze the data. Also data points that have no cluster can be referred as outlier

EMMD

  • Expectation Maximization Mixed Data (EMMD) is a clustering method based on probabilities. It support numerical and categorical data
  • Use in mixed data (numeric and categorical) and unknown number of cluster. For example grouping product color together with sales, state and expanses may show that red is similar to yellow at 3 different states where sales and expenses have a high value
  • Requires upper limit for cluster number
  • The analysis of this kind clustering involves the usage of several graphs and several slices. Another usage would be finding outliers in the depended mixed data space.

Hierarchical clustering

  • Hierarchical clustering group data over a variety of scales by creating a cluster tree. Clusters at one level of the tree are joined as clusters at the next level. This allows you to decide the level or scale of clustering that is most appropriate for the application.
  • This algorithm also support both numeric and categorical data.
  • Use in mixed data and small datasets an example would be similar to the EMMD but can only work with small dataset and will output with more accurate result.
  • Requires number of clusters
  • The analysis of this kind clustering involves the usage of several graphs and several slices.

K-means

  • k-means clustering aims to partition numeric observations into k clusters (chosen or estimated) in which each observation belongs to the cluster with the nearest mean.
  • Use for numerical data. A usage example for k-means can be geo-clustering in order to find someone's home and work by only latitude and longitude.
  • Requires (optional) number of clusters. If not specified the number of cluster is determined by the elbow method
  • Analyze by scatter / bubble / maps graph, color by cluster number, may indicate a similar group that later can be filter and further analyze.

PAM

  • Partitioning (clustering) of the data into k clusters "around medoids", a more robust version of K-means
  • Can be used in noisy dataset (large number of outlier etc.) the advantage over the k-means is that the cluster location is less effected by outlier.
  • Requires number of cluster
  • Analyze by scatter / bubble / maps graph, color by cluster number, may indicate a similar group that later can be filter and further analyze.

Canopy

  • Canopy clustering is the fastest numeric clustering algorithm. It is used as a preprocessing step for other clustering algorithms or for speeding up the clustering operations in large datasets
  • Use when the cluster number is unknown and the dataset is large, such as grouping different activities based on accelerometer data
  • Analyze by scatter / bubble graph color by cluster number, may indicate a similar group that later can be filtered and further analyzed. The number of groups by themselves can suggest a different approach to analyzing the data. Data points that have no cluster can be referred as outliers.

Classifiers - Prediction

Classification is the problem of identifying to which of a set of categories (features) a new observation belongs (label), on the basis of a training set of data containing observations (or instances) whose category membership is known

KNN

  • Is a classifier for categorical values, based on numeric distance
  • Uses numeric data as feature vector and categorical data as labels. Can be used for prediction if someone will buy a citrine item based on the number of times he/she saw a commercial, yearly income and number of previous purchases in an online store.
  • It requires a minimum number of neighbors (K)
  • Predicting a new sample data in order to estimate if potential buyer will buy

Naïve Bayes

  • Naïve Bayes is a multiclass classification algorithm with the assumption of independence between every pair of features.
  • Uses categorical data as feature vector and categorical data as labels. For example, it can be used for predicting if someone will play tennis based on Outlook (sunny/overcast/raining) Temperature(hot/mild/cold) Humidity(high/normal) Windy(true/false)
  • Requires λ (smoothing parameter) for handling sparse data or unknown word
  • Predict if your father will go play tennis or stay home.

 

  • Click here to learn more about Naïve Bayes.

Decision tree

  • Decision tree (as a predictive model) each column represent a branch, follow the tree for each row (from the root bottom) in order to predict. Slow but accurate
  • Uses mixed data for branches (feature vector) and categorical for prediction (labels). Can be used for prediction of purchase not purchase based on mixed data (age, education, height, home owner) with small data set
  • Predicting a new sample data in order to estimate if potential buyer will buy.

 

  • Click here to learn more about Decision Trees.

Random forest

  • Random forests are ensembles of decision trees. They combine many small, randomized sampled decision trees in order to reduce the risk of over fitting.
  • Uses mixed data for feature vector and categorical data as labels. Random forest could be used to predict whether someone will have the flu based on (smoker, diabetes, alcohol, vegetarian, live in the city etc.)
  • Predict how should get vaccine.

Shallow Neural Net

  • Artificial Neural nets are computing systems inspired by the biological neural networks that constitute animal brains. The shallow neural net is consist of one hidden layer
  • Uses mixed data for feature vector and categorical data as labels. Can be used for image recognition for example if a user profile picture is a male or a female
  • Add gender tag for all users and save a question in the join-in form.

Support Vector Machine (SVM)

  • Support Vector Machines is a classification algorithm that maps the data to points in space such that the gap between points of different categories is as wide as possible
  • Uses numeric data for the feature vector and categorical data as labels. Can be used for prediction if someone will buy a certain item based on the number of times he/she saw a commercial, yearly income and number of previous purchases in an online store.

Regression

Regression tree

  • Is an estimation of numeric value based on tree
  • Uses mixed data for feature vector and numeric data as target. Can be used for either clean data or estimation of numeric values
  • It requires the depth and wide of the tree
  • Estimate sales based on new data

 

  • Click here to learn more about Regression Trees.

Interpolation

Linear interpolation

  • Is a mathematical method of modeling the data by a set of linear lines, with each line connecting two sample data points
  • Can be used for smoothing inaccurate data that inherently contains noise
  • Preferable way to interpolate when the data behaves similarly to a set of linear lines
  • Get smoothed values of monthly expenses data

Polynomial interpolation

  • Is a mathematical method of modeling the data by a polynomial connecting sample data points
  • Can be used for smoothing inaccurate data that inherently contains noise
  • Preferable when the data behaves similarly to a set of polynomial
  • Get smoothed values of monthly expenses data

Splines

  • Is a mathematical method of modeling the data by a set of polynomials (splines), each polynomial connecting two data points
  • Can be used for smoothing inaccurate data that inherently contains noise
  • Get smoothed values of monthly expenses data

 

NEXT: Learn about the Machine Learning Nodes.